Tutorial 3: Cleaning and Wrangling Data¶
Lecture and Tutorial Learning Goals:¶
After completing this week's lecture and tutorial work, you will be able to:
- define the term "tidy data"
- discuss the advantages and disadvantages of storing data in a tidy data format
- recall and use the following tidyverse functions and operators for their intended data wrangling tasks:
selectfilter|>mapmutatesummarizegroup_bypivot_longerpivot_wider%in%
Any place you see ..., you must fill in the function, variable, or data to complete the code. Replace fail() with your completed code and run the cell!
This worksheet covers parts of the Wrangling chapter of the online textbook. You should read this chapter before attempting the worksheet.
### Run this cell before continuing.
library(repr)
library(tidyverse)
source("tests.R")
source("cleanup.R")
options(repr.matrix.max.rows = 6) #limits output of dataframes to 6 rows
Question 0.1
{points: 1}
Match the following definitions with the corresponding functions used in R:
A. Transforms the input by applying a function to each element and returning a vector the same length as the input.
B. Reads files that have columns separated by tabs.
C. Most data operations are done on groups defined by variables. This function takes an existing data set and converts it into a grouped data set where operations are performed "by group".
D. Works in an analogous way to mutate, except instead of adding columns to an existing data frame, it creates a new data frame.
E. "lengthens" data, increasing the number of rows and decreasing the number of columns.
F. Labels the x-axis.
Functions
group_bymapread_tsvsummarisexlabpivot_longer
For every description, create an object using the letter associated with the definition and assign it to the corresponding number from the list of functions. For example:
A <- 1
B <- 2
C <- 3
...
F <- 6
# Replace the fail() with your answer.
# your code here
#fail() # No Answer - remove if you provide an answer
A <- 2
B <- 3
C <- 1
D <- 4
E <- 6
F <- 5
1. Historical Data on Avocado Prices¶
In the tutorial, we will be finishing off our analysis of the avocado data set.
You might recall from the lecture that millennials LOVE avocado toast. However, avocados are expensive and this is costing millennials a lot more than you think (joking again 😉, well mostly...). To ensure that they can save enough to buy a house, it would be beneficial for an avocado fanatic to move to a city with low avocado prices. From Worksheet 3 we saw that the price of the avocados is less in the months between December and May, but we still don't know which region contains the cheapest avocados.

image source: https://media.giphy.com/media/8p3ylHVA2ZOIo/giphy.gif
As a reminder, here are some relevant columns in the dataset:
average_price- The average price of a single avocado.type- conventional or organicyear- The yearregion- The city or region of the observationsmall_hass_volumelarge_hass_volumeextra_l_hass_volume
Additionally, the last three columns can be used to calculate total_volume in pounds (lbs). The goal for today is to find the region with the cheapest avocados and then produce a plot of the total number of avocados sold against the average price per avocado (in US dollars) for that region. To do this, you will follow the steps below.
- use a tidyverse
read_*function to load the csv file into your notebook - use
group_by + summarizeto find the region with the cheapest avocados. - use
filterto specifically look at data from the region of interest. - use
mutateto add up the volume for all types of avocados (small, large, and extra) - use
ggplotto create our plot of volume vs average price
Question 1.1
{points: 1}
Read the file avocado_prices.csv found in the tutorial_03 directory using a relative path.
Assign your answer to an object called avocado.
# your code here
#fail() # No Answer - remove if you provide an answer
avocado <- read_csv ("data/avocado_prices.csv")
avocado
Question 1.2
{points: 1}
Now find the region with the cheapest avocados in 2018. To do this, calculate the average price for each region. Your answer should be the row from a data frame with the lowest average price. The data frame you create should have two columns, one named region that has the region, and the other that contains the average price for that region.
Assign your answer to an object called cheapest.
Hint: You can use the function
sliceto provide a row index (or range of row indices) to numerically subset rows from a data frame. For example, to subset the 5th row of a data frame you would type:dataframe |> slice(5)
#... <- ... |>
# ... |>
# group_by(...) |>
# summarize(...) |>
# arrange(...) |>
# slice(...)
# your code here
#fail() # No Answer - remove if you provide an answer
cheapest <- avocado |>
filter(yr == 2018) |>
group_by(region) |>
summarize(average_price = mean(average_price, na.rm = TRUE)) |>
arrange(average_price) |>
slice(1)
cheapest
Question 1.3
{points: 1}
Now we will plot the total volume against average price for the cheapest region and all years. First, you need to mutate the data frame such that total_volume is equal to the addition of all three volume columns. Next, filter the dataset using the cheapest region found in Question 1.2. Finally, you will have the data necessary to create a scatter plot with:
- x =
total_volume - y =
average_price
Fill in the ... in the cell below. Copy and paste your finished answer and replace the fail(). We will be scaling the axes - the function is added in the scaffolding for you.
Assign your answer to an object called avocado_plot.
Hint: Do not forget units on your data visualization! Here the price is in US dollars (USD) and the volume in pounds (lbs).
options(repr.plot.width = 8, repr.plot.height = 7)
avocado_plot <- avocado |>
mutate(total_volume = small_hass_volume + large_hass_volume + extra_l_hass_volume ) |>
filter(region == "Houston") |>
ggplot(aes(x = total_volume, y = average_price)) +
geom_point() +
xlab("Total Volume") +
ylab("Average Price") +
scale_x_log10()
# your code here
#fail() # No Answer - remove if you provide an answer
avocado_plot
Question 1.4
What do you notice? Discuss your plot with the person next to you.
To further investigate this trend, let's colour the data points to see if the type of avocado (either organic or not, which is called conventional in this data set) affects the volume and price of avocados sold in our region of interest.
Run the cell below to colour the data points by avocado type.
#run this cell to set plot width/height
#change the numbers below if the plot doesn't fit on your screen and run again!
options(repr.plot.width = 8, repr.plot.height = 6)
# Run this cell to see if avocado type (the type variable) plays a role in production and price.
avocado_plot <- avocado_plot +
geom_point(aes(colour = type)) +
theme(text = element_text(size = 20))
avocado_plot
Question 1.4 (Continued)
{points: 3}
In 2-3 sentences, describe what you see in the graph above. Comment specifically on whether there is any evidence/indication that avocado type might influence price?
Hint: Make sure to include information about volume, average price, and avocado type in your answer.
I think there is definitely an indication that different avocados DO in fact influence price. We can see that Conventional Avocados have a lower average cost. We also see that despite the organic Avocadoes selling less in volume, they typically have a much higher average price, with some being well above the norm.
2. Historical Data on Avocado Prices (Continued)¶
Question 2.1
{points: 3}
Now that we know the region that sold the cheapest avocados (on average) in 2018, which region sold the most expensive avocados (on average) in 2018? And for that region, what role might avocado type play in sales? Repeat the analysis you did above, but now apply it to investigate the region which sold the most expensive avocados (on average) in 2018.
Remember: we are finding the region that sold the most expensive avocados in 2018, but then producing a scatter plot of average price versus total volume sold for all years.
Name your plot object priciest_plot.
# your code here
#fail() # No Answer - remove if you provide an answer
expen <- avocado |>
filter(yr == 2018) |>
group_by(region) |>
summarize(average_price = mean(average_price, na.rm = TRUE)) |>
arrange(desc(average_price)) |>
slice(1)
expen
priciest_plot <- avocado |>
mutate(total_volume = small_hass_volume + large_hass_volume + extra_l_hass_volume ) |>
filter(region == "HartfordSpringfield") |>
ggplot(aes(x = total_volume, y = average_price)) +
geom_point() +
xlab("Total Volume") +
ylab("Average Price") +
scale_x_log10()
priciest_plot <- priciest_plot +
geom_point(aes(colour = type)) +
theme(text = element_text(size = 20))
priciest_plot
cell-7ac36cf31d63967a
Score: 3.0 / 3.0 (Top)
# check that plot has the correct name
test_that('scatter plot should be named priciest_plot', {
expect_true(exists('priciest_plot'))
})
print('plot has correct name')
# The remainder of the tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
properties <- c(priciest_plot$layers[[1]]$mapping, priciest_plot$mapping)
if(length(priciest_plot$layers) > 1){
properties2 <- c(priciest_plot$layers[[2]]$mapping, priciest_plot$mapping)
} else {
properties2 <- c(priciest_plot$layers[[1]]$mapping, priciest_plot$mapping)
}
test_that('priciest_plot should be a scatter plot', {
expect_true("GeomPoint" %in% c(class(priciest_plot$layers[[1]]$geom)))
})
test_that('Plot should only include data from the HartfordSpringfield region', {
expect_equal(unique(priciest_plot$data$region), 'HartfordSpringfield')
})
test_that('Data from the HartfordSpringfield region is missing', {
expect_equal(int_round(nrow(priciest_plot$data), 0), 338)
})
test_that('total_volume should be on the x-axis', {
expect_true(rlang::get_expr(properties$x) == 'total_volume')
})
test_that('average_price should be on the y-axis', {
expect_true(rlang::get_expr(properties$y) == 'average_price')
})
test_that('Labels on the axes should be descriptive and human readable.', {
expect_false((priciest_plot$labels$x) == 'total_volume')
expect_false((priciest_plot$labels$y) == 'average_price')
})
test_that('Colour should be mapped to type', {
expect_true('type' %in% c(rlang::get_expr(properties$colour),
rlang::get_expr(properties2$colour)))
})
print('Success!')
### END HIDDEN TESTS
Question 2.2
{points: 3}
In 2-3 sentences, describe what you see in the graph above for the region with the most expensive avocados (on average). Comment specifically on whether there is any evidence/indication that avocado type might influence price.
Hint: Make sure to include information about volume, average price, and avocado type in your answer.
priciest_plot
We can see a similar trend to Houston's graph. The organic avocados; despite selling less in value, are typically sold at higher average prices. We also see that conventional avocados still typically sell for a lower price. It is also worth nothing that compared to Houston's graph, the difference in cost between the two types is higher in Hartford Springfield.
Question 2.3
{points: 3}
Plot the scatterplots for the two regions so that they are in adjacent cells (so it is easier for you to compare them). Compare the price and volume data across the two regions. Then argue for or against the following hypothesis:
"the region that has the cheapest avocados has them because it sells less of the organic (expensive) type of avocados compared to conventional cheaper ones."
# You can put your scatter plot code in this cell!
library(cowplot)
plot_grid(priciest_plot, avocado_plot, ncol = 2)
Write about your comparison and argument in this cell!
"the region that has the cheapest avocados has them because it sells less of the organic (expensive) type of avocados compared to conventional cheaper ones."
This statement simply isn't true. This is because the plot does not show the average price of ALL avocados sold, but rather the average price of the 2 different types of Avocados. Due to this, the thesis above is completely invalid. Not to mention, the conventional avocadoes in the priciest graph show that their prices are just higher in general (look at the scale)
3. Sea Surface Salinity in Departure Bay¶
As mentioned in this week's Worksheet, Canada's Department of Fisheries and Oceans (DFO) compiled environmentally essential data from 1914 to 2018. The data was collected at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (degrees Celsius) and salinity (practical salinity units, PSU)observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.
In Worksheet 3 we already worked with the temperature observations. Now, we will be focusing on salinity! Specifically, we want to see if the monthly maximum salinity has been changing over the years. We will only be focusing our attention on the winter months December, January and February.
Question 3.1
{points: 1}
To begin working with this data, read the file max_salinity.csv into R. Note, this file (just like the avocado data set) is found within the tutorial_03 folder.
Assign your answer to an object called sea_surface.
# your code here
#fail() # No Answer - remove if you provide an answer
sea_surface <- read_csv ("data/max_salinity.csv")
sea_surface
Question 3.2
{points: 3}
Given that ggplot prefers tidy data, we must tidy the data! Use the pivot_longer() function to create a tidy data frame with three columns: Year, Month and Salinity. Remember we only want to look at the winter months (December, January and February) so don't forget to reduce the data to just those three!
Assign your answer to an object called max_salinity.
max_salinity <- sea_surface |>
select(Jan, Dec, Feb, Year) |>
pivot_longer(cols = -Year,
names_to = 'Month',
values_to = 'Salinity')
# your code here
#fail() # No Answer - remove if you provide an answer
max_salinity
cell-26f7f43890bb8500
Score: 3.0 / 3.0 (Top)
# check that data frame has the correct name
test_that('data frame should be named max_salinity', {
expect_true(exists('max_salinity'))
})
print('data frame has correct name')
# The tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('max_salinity should have 3 columns and 315 rows', {
expect_equal(int_round(nrow(max_salinity), 0), 315)
expect_equal(int_round(ncol(max_salinity), 0), 3)
})
test_that('max_salinity should contain columns named Year, Month and Salinity', {
expect_equal(paste(sort(colnames(max_salinity)), collapse = ""), 'MonthSalinityYear')
})
test_that('incorrect data values in year column of max_salinity', {
expect_equal(int_round(sum(as.numeric(max_salinity$Year)), 0), 619290)
})
### END HIDDEN TESTS
Question 3.3
{points: 3}
Now that we've created new columns, we can finally create our plot that compares the maximum salinity observations to the year they were recorded. As usual, label your axes!
Assign your answer to an object called max_salinity_plot.
Hint: do not forget to add units to your axes! Remember from the data description that salinity is measured in practical salinity units (PSU).
# your code here
#fail() # No Answer - remove if you provide an answer
max_salinity_plot <- max_salinity |>
ggplot(aes(x = Year, y = Salinity)) +
geom_point() +
xlab("Year") +
ylab("Max Salinity (PSU)")
max_salinity_plot
cell-14402cfa4fcba390
Score: 3.0 / 3.0 (Top)
# check that plot has the correct name
test_that('scatter plot should be named max_salinity_plot', {
expect_true(exists('max_salinity_plot'))
})
print('plot has correct name')
# The tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
properties <- c(max_salinity_plot$layers[[1]]$mapping, max_salinity_plot$mapping)
test_that('max_salinity_plot should be a scatter plot', {
expect_true("GeomPoint" %in% c(class(max_salinity_plot$layers[[1]]$geom)))
})
test_that('Data points from max_salinity data set is missing', {
expect_equal(int_round(nrow(max_salinity_plot$data), 0), 315)
})
test_that('Year should be on the x-axis', {
expect_true(rlang::get_expr(properties$x) == 'Year')
})
test_that('Salinity should be on the y-axis', {
expect_true(rlang::get_expr(properties$y) == 'Salinity')
})
### END HIDDEN TESTS
Question 3.4
{points: 3}
In 1-2 sentences, describe what you see in the graph above. Comment specifically on whether there is a change in salinity across time for the winter months and if there is, whether this indicates a postive or a negative relationship for these variables within this data set. If there is a relationship, also comment on its strength and linearity.
I think overall there has definitely been a change in Salinity over time for the winter months.
The relationship is negative, as time goes on, Max Salinity is going down. (This is what makes it a negative relationship)
I think the relationship is pretty strong. There are very little outliers, which indicate consistent data. I believe there is some degree of linearity that shows how as time inreases, Salinity decreases.
4. Pollution in Madrid¶
The goal of this analysis (which we started in worksheet_wrangling) is to see if pollutants are decreasing (is air quality improving) and also determine which pollutant has decreased the most over the span of 5 years (2001 - 2006). In worksheet_wrangling we investigated what happened with the maximum values of each pollutant over time, now we will investigate the average values of each pollutant over time. To do this we will:
- Calculate the average monthly value for each pollutant for each year.
- Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed.
- Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average instead of the maximum values.
Question 4.1
{points: 3}
To begin working with this data, read the file madrid_pollution.csv. Note, this file (just like the other data sets in this tutorial) is found in the tutorial_wrangling directory.
Assign your answer to an object called madrid.
# your code here
#fail() # No Answer - remove if you provide an answer
madrid <- read_tsv ("data/madrid_pollution.csv")
madrid
cell-91cc20e8abf34cef
Score: 3.0 / 3.0 (Top)
# check that data frame has the correct name
test_that('data frame should be named madrid', {
expect_true(exists('madrid'))
})
print('data frame has correct name')
# The tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('madrid should have 15 columns and 51864 rows', {
expect_equal(int_round(nrow(madrid), 0), 51864)
expect_equal(int_round(ncol(madrid), 0), 17)
})
test_that('incorrect data values in year column of BEN', {
expect_equal(int_round(sum(as.numeric(madrid$BEN), na.rm = TRUE), 0), 106936)
})
### END HIDDEN TESTS
Given that we are going to plotting months, which are dates, let's tell R how they should be ordered. We can do this by changing the month column from a character vector to a factor vector. Factors in R are useful for categorical data and they have an order.
# run this cell to order the column month by month (date) and not alphabetically
madrid <- madrid |>
mutate(month = factor(month, levels = c('January','February','March','April',
'May','June','July','August',
'September','October','November','December')))
Question 4.2
{points: 3}
Calculate the average monthly value for each pollutant for each year and store that as a data frame. Your data frame should have the following 4 columns:
yearmonthpollutantmonthly_avg
Name your data frame madrid_avg.
madrid
# your code here
#fail() # No Answer - remove if you provide an answer
madrid_avg <- madrid |>
pivot_longer(cols = BEN:TOL,
names_to = "pollutant",
values_to = "value") |>
group_by (year, month, pollutant) |>
summarize(monthly_avg = mean(value, nan.rm = TRUE, na.rm = TRUE))
madrid_avg
cell-798eeb4501f741e9
Score: 3.0 / 3.0 (Top)
# check that data frame has the correct name
test_that('data frame should be named madrid_avg', {
expect_true(exists('madrid_avg'))
})
print('data frame has correct name')
# The tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('madrid_avg should have 4 columns and 1008 rows', {
expect_equal(int_round(nrow(madrid_avg), 0), 1008)
expect_equal(int_round(ncol(madrid_avg), 0), 4)
})
test_that('incorrect data values in year column of monthly_avg', {
expect_equal(int_round(sum(as.numeric(madrid_avg$monthly_avg), na.rm = TRUE), 0), 22823)
})
test_that('incorrect data types in columns of madrid_avg', {
expect_equal(paste(sort(map_chr(madrid_avg, class)), collapse = ""), 'characterfactornumericnumeric')
})
### END HIDDEN TESTS
Question 4.3
{points: 3}
Create a scatter plot for the average monthly value for each month. Plot these values for each pollutant and each year so that a trend over time for each pollutant can be observed. To do this all in one plot, you are going to want to use a facet_grid layer (makes subplots within one plot when data are "related") and a theme layer (to adjust the angle of the text on the x-axis. We provide you with the code for these two layers in the scaffolding for this plot.
options(repr.plot.width = 16, repr.plot.height = 18)
pollutant_labels <- c(BEN = "Benzene \n(μg/m³)",
CO = "Carbon \nmonoxide \n(mg/m³)",
EBE = "Ethylbenzene \n(μg/m³)",
MXY = "M-xylene \n(μg/m³)",
NMHC = "Non-methane \nhydrocarbons \n(mg/m³)",
NO_2 = "Nitrogen \ndioxide \n(μg/m³)",
NOx = "Nitrous \noxides \n(μg/m³)",
O_3 = "Ozone \n(μg/m³)",
OXY = "O-xylene \n(μg/m³)",
PM10 = "Particles \nsmaller than 10 μm",
PXY = "P-xylene \n(μg/m³)",
SO_2 = "Sulphur \ndioxide \n(μg/m³)",
TCH = "Total \nhydrocarbons \n(mg/m³)",
TOL = "Toluene \n(μg/m³)")
madrid_avg_plot <- madrid_avg |>
ggplot(aes(x = month, y = monthly_avg)) +
geom_point() +
xlab("Month") +
ylab("Monthly average (μg/m³)") +
facet_grid(rows = vars(pollutant), cols = vars(year), scales = "free",
switch = "y",
labeller = labeller(pollutant = pollutant_labels)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
strip.text.y.left = element_text(angle = 0),
text = element_text(size = 20))
# your code here
#fail() # No Answer - remove if you provide an answer
madrid_avg_plot
Question 4.4
{points: 3}
By looking at the plots above, which monthly average pollutant levels appear to have decreased over time? Which appear to have increased?
Toluene, P-xylene, M-xylene, Carbon Monoxide and Benzene appears to have decreased over time. One may argue that Ethlybenzene has decreased overtime too, since there are less peaks, and are consistently lower than before.
Non-Methane Carbons appears to have increased. I make this observation based on the fact that for 2001 and 2006, they had very similar starting points, but 2001 had a much lower "valley". Whereas 2006 remained consistently high. (Not good)
Question 4.5
{points: 3}
Now we will look at which pollutant decreased the most between 2001 - 2006 when we look at the average yearly values for each pollutant. Your final result should be a data frame that has two columns (pollutant and yearly_avg_diff) and one row (the most decreased pollutant when looking at yearly average between 2001 - 2006). Make sure to use the original madrid data frame in your solution.
There are several different ways to solve this problem. My solution included using both pivot_longer and pivot_wider. See here for more info.
# your code here
#fail() # No Answer - remove if you provide an answer
madrid_diff <- madrid |>
filter(year == c(2006,2001)) |>
pivot_longer(cols = BEN:TOL,
names_to = "pollutant",
values_to = "value") |>
group_by (year, pollutant) |>
summarize(year_avg = mean(value, na.rm = TRUE, nan.rm = TRUE))|>
pivot_wider(names_from = "year",
values_from = "year_avg",
names_prefix = "yr")|>
mutate(diff = yr2006 - yr2001)|>
select (pollutant, diff)|>
arrange(diff)
madrid_diff
Question 4.6
{points: 3}
Did using the average to find the most decreased pollutant between 2001 and 2006 give you the same answer as using the maximum in the worksheet? Is your answer to the previous question surprising? Explain.
Completely different answer. I think this was not a surprise because when you use MAX values to calculate, you tend to include either erroneous data, or you include "rare" cases. For this matter, using the MAX showed that the max decreased pollutant was TOL, when in reality, it was NOx. Switching from MAX to Averages changed TOL's value from -178 to -9. This order of magnitude difference shows that the inaccuracies do not reflect real life.
Optional Question
(for fun and does not count for grades):
Consider doing the same analysis as you did for Question 4.5, except this time calculate the difference as a percent or fold difference (as opposed to absolute difference as we did in Question 4.5). The scales for the pollutants are very different, and so we might want to take this into consideration when trying to answer the question "Which pollutant decreased the most"?
# Your optional answer goes here
source("cleanup.R")